Topic modeling visualization using noun phrases from ds_PWDB

We will use topic model with gensim workflow to build a topic model based on Latent Dirichlet Allocation (LDA) algorithm using noun phrases from ds_PWDB dataset, and explore strategies to effectively visualize the results using plotly package.

Import libraries

Download dataset and visualize it

Prepare the data

After importing and visualize the dataset table, we detected the columns that contains textual data and concatenate them to have an entire text to wotk with.

Clean the data

After we extract the data, one of the most important step is cleaning them. We will delete below characters from our text to make it cleaner.

Extract noun phrases

Let's visualize the noun phrases that we found

Build the topic model

To build the LDA topic model we will use WordsModeling class that contains the corpus and the dictionary, created from detected noun phrases' keywords and weights.

LDA Training is executed using gensim library - LdaMulticore that creates 10 different topics based on our noun phrases.Here are keywords from random generated topic

LDA Visualization

In LDA models, each document is composed of multiple topics. Here we will know which document belongs predominantly to which topic. The table is separated in 5 columns and contains the data based on LDA training with extracted noun phrases. Each noun phrase represents a document that belongs to a list of keywords that act as topics. To see the contribution of each document, there is Topic Percentage Contribution column that shows us the influence of each noun phrase.

Here we are using the package plotly to create a bar chart to visualize the information in the table above.

In first chart we can see the sum of documents in each topic and how predominant is the 7th topic comparing to other.

In the second chart we see the topics, but with contribution. Even the 7th topic contains the most of representative text, but it has the lowest percentage of contribution, comparing to other.

Based on information that we have, let's view the most contributory topics.

Let’s compute the total number of documents attributed to each topic.

Here is the distribution of dominant topics in each document

Total topic distinction by actual weight

Finally, pyLDAVis is the most commonly used, and a nice way to visualise the information contained in a topic model. Below is the implementation for LdaModel().